Borrowing the idea from the CAPM and Fama - French model, we build our own five factor model. Our factors are market risk factor, liquidity, HML factor, covid19 factor and news sentiment factor. We also verify our factors by using OLS regression and validate our model afterwards.
$r_i$ is the daily return of each stock and $r_f$ is the 1 month treasury bill rate.
We choose the top 70 weighted stocks from the components of S&P 500 Index and calculate their $r_i$ by using pct_change function.
Goal:
$r_i$: Scrape the daily adjusted closing price of 70 stocks from 2019.11.25 to 2020.11.25.
$r_f$: We scrape year 2019 and 2020 data seperately and then combine them together.
Data Source:
For Market Risk Factor :
The market risk factor is firstly introduced by CAPM model and then further developed by University of Chicago professors Eugene Fama and Kenneth French, and used in their Three-factor Model [1]. Market risk premium ($r_m$ - $r_f$) is the difference between the expected return of the market and the risk-free rate. It provides an investor with an excess return as compensation for the additional volatility of returns over and above the risk-free rate.
In our design, we use S&P 500 Index as a proxy of the market return and 1-month treasury bill rate as the risk-free return.
For Liquidity Factor :
A stock’s liquidity generally refers to how rapidly shares of a stock can be bought or sold without substantially impacting the stock price. Stocks with low liquidity may be difficult to sell and may cause you to take a bigger loss if you cannot sell the shares when you want to.
According to Danyliv and Bland (2015) [2], the liquidity measure is the ratio of volume traded multiplied by the closing price divided by the price range from high to low, for the whole trading day, on a logarithmic scale. They use the price at the end of the trading period because it is the most accurate valuation of the stock at the time. They use the traded volume for the day, assuming volume traded is a linear function of time. Price range is for the whole trading session. In this report, we use the same setting as Danyliv and Bland do [3]:
$$\text{Liquidity} = \ln{\frac{\text{Adjusted Close Price} * \text{Volume}}{\text{High Price} - \text{Low Price}}}$$
For HML Factor :
High Minus Low (HML), also referred to as the value premium, is the spread in returns between value companies and growth companies. From historically data, it is found that companies with high book-to-market ratios, also known as value companies, outperform those with lower book-to-market ratios, known as growth companies.[4]
This phenomenon can be easily explained by the Efficient Markets Hypothesis (EMH), the value companies tend to have higher cost of capital and greater business risk, which leads to the higher required rate of return, resulting in the outperformance compared with the growth companies.
Thus, the HML factor can measure the size risk of the company. If the company has a high book-to-market ratio, which shows that it is of high value, the model regression will show a positive relation to the HML factor, explaining that the company’s return is attributable to the value premium.
For Sentiment Factor :
News sentiment factor,as mentioned by [5], "Hundreds pieces of financial news are released on different media every day and every trader takes great efforts in keeping track of the latest news and updates trade calls accordingly." Thus it is clear that the news has a great impact on the prices of stocks.
For Covid-19 Factor:
The unexpected Covid19 pandemic has caused huge impact on real economic activity around the world. Many cultural and supporting events have been suspended. Governments are taking emergency measures, such as shutdowns for social distancing and investments in testing and quarantining the suspected cases and treating the confirmed cases, to contain the disease.
Badar Nadeem Ashraf (2020)[6] found that stock markets responded negatively to the growth in COVID-19 confirmed cases. That is, stock market returns declined as the number of confirmed cases increased. Based on his research, we take COVID-19 confirmed cases into account and set its growth rate as the last factor of our 5-factors model.
Goal:
Scrape S&P 500 Index and 1-month T- bill rate from 2019.11.25 to 2020.11.25
Data Source:
The S&P 500 Index is scraped from Yahoo Finance and the risk free rate is scraped from US Treasury-Daily Treasury Yield Curve Rates.
Scraping Process:
$r_m$: Because the interface of yahoo finance is dynamic, we cannot scrape the page at once for our chosen time range. Therefore,we seperate a year into three periods.
$r_f$: We scrape year 2019 and 2020 data seperately and then combine them together.
For data cleaning part:
we use iloc method to drop out some unwanted rows and pd.to_numeric method to change some columns to numerical format.
For factor creating part:
$r_f$:
The daily percentage return of other varibales are all in decimals while $r_f$ is in percentage (without % sign), so we need to divide it by 100.
The 1-month Treasury Bill rate is an annualised interest rate. In order to put everything to the same scale, break down the Treasury bill rate to daily rate by dividing 365.
Then construct $r_i$ - $r_f$
The slope of the regression line is the coeffienct of the market risk factor $\beta$. The beta of a potential investment is a measure of how much risk the investment will add to market portfolio. If a stock is riskier than the market, it will have a beta greater than one. A $\beta$ of less than 1 indicates that a stock's price is less volatile than the overall market.
The slope of AAPL is larger than 1, indicating it is more volatile than the market and the slope of UPS is less than 1, indicating it is less risky than the fully diversified market portfolio.
Goal:
Scrape Adjusted Closing Price, Volume, High Price and Low Price for each stock we have chosen.
Data source:
Scraping process:
Scrape basic information of each stock respectively from Yahoo Finance by using python "pandas_datareader" package.
For factor creating part:
we calculate the mean of liquidity of each stock and median of the mean of liquidity of each stock. In this step, we get the median of liquidity of all stocks.
Then we use the median we have calculated in first step to divide all stocks into two groups (high liquidity stocks and low liquidity stocks).
At last, we combine all dataframes of 70 stocks and calculate the mean of $R_i$ of high liquidity stocks minus that of low liquidity stocks. Finally, we get "liquidity" factor.
1. Classification of high and low liquidity stocks
Classification criteria
If $\bar {Liquidity}_{stock}\geq Liquidity_{median}$, it has high liquidity and classified as 'H'.
If $\bar {Liquidity}_{stock}<Liquidity_{median}$, it has low liquidity and classified as 'L'.
2. Calculate the 'Liquidity' factor based on groups
We can caluculate 'Liquidity' factor from the following formula:
$$Liquidity_t=\bar {return}^{High}_t - \bar {return}^{Low}_t$$Scatter Plot
We compare relation between liquidity and $(R_i-R_f)$ of two stocks and plot a scatter figure respectively.
When only comparing the relationship between the dependent variable and liquidity factor. We can easily find that the performance of these two ols-fitting is not perfect. Each of them does not show strong linearity and there are many outliers in two figures.
Line Chart
We draw a figure of high and low liquidity stock groups and observe change of liquidity of two groups in this time period.
Seeing the chart above, we find that the difference between return of high and low liquidity stock is quite small. It indicates that the liquidity may can not be a robust factor to explain the expected return of a stock itself.
Goal:
Scrape PB ratio and daily closing price of 70 companies from 2019.11.25 to 2020.11.25.
Data Source:
The PB ratio and daily closing price is scraped from Yahoo Finance.
1. Classification of Value and Growth company
Index
For the sake of convenience, we use PB ratio as a substitute for book-to-market ratio. PB ratio is the reciprocal of the book-to-market ratio and it is widely used in financial analysis. We scraped PB ratio directly from Yahoo Finance Website and select the most current release as our criteria to classify the value and growth company.
Classification criteria
If $PB_{company}\geq PB_{median}$, it has high growth and classified as 'H'.
If $PB_{company}<PB_{median}$, it has low growth and classified as 'L'.
If the company's book value is negative, which means the liability is more than the asset, we classify it as 'L' because we don't think it's sustainable with negative net asset.
2. Combine PB ratio with return data
3. Caculate HML based on the group
We can caluculate HML from the following formula,
$$HML_t=\bar {return}^{High}_t - \bar {return}^{Low}_t$$BoxPlot
Based on the HML computed above, we can fistly draw a box plot to have a basic understanding of the distribution of this index.
From the image above, we can find that the median of HML is slightly bigger than zero, indicating that high growth companies outperform low growth ones during the time period we select, which is contrary to our assumption. This phenomenon can be explained from the following reasons [7]:
- The more scarce growth has become, the more investors have been prepared to pay for it.
- The falling interest rate environment has resulted in investors’ willingness to pay up for long-term growth.
- Monetary stimulus has boosted asset prices to inflated levels so that investors no longer care about valuation.
Furthermore, the current dominance of Covid-19 makes value companies – such as those in the travel, leisure, energy and banking sectors – look very unattractive.
Then, we can have a look at the comparison of mean daily return between the high and low growth companies.
LineChart
Line chart of daily HML can help us figure out the change of HML explicitly. From the first image, it can be found that daily HML in American stock market have become more volatile in 2020, which may result from the outbreak of covid-19, widening the gap between growth companies and value companies. We can also look at the change of high growth companies and low growth companies respectively in the second picture.
Goal:
Scrape Stock-related news text data
Data Source:
We scraped news text data from Benzinga, which was founded in 2010 and is headquartered in Michigan. Benzinga is a financial news and analysis service providing timely, actionable insights for investors. It also helps people to improve their investing and trading results by providing superior market information, tool, and data.
Scraped Process:
We choose top listed stocks in S&P500 and scrape their news text data respectively from Benzinga by using BeautifulSoup. (We tried selenium and Scrapy first but both failed due to some reasons).
We combine all the process into one function as Get_News_from_Benzinga(stock) to get the historical news text data in Benzinga for every stock we want. (See below)
Note: For each stock, we save the DataFrame into csv file.
1. Core Parts of Get_News_from_Benzinga( ) function
2. The main function that run the above scraping function
3. Load and Concat all csv files
1. Some Basic Cleaning
Drop 'Unnamed: 0' Column: when we use to_csv to save the data scraped from Benzinga, the index column was saved as 'Unnamed: 0', here when we load the data, it automatically adds an index, so we need to drop this column.Remove null rows: As above, the Content column contains some null values, so, we need to remove them from our dataset.Deal with missing values in Data and Title Column: Recall that when we scraping data, we set Data=-1 and Title=None for those that we can not accessed, if after dropna, there still are some of them, then we need to clean them all.Reset Index2. Some Text Related Cleaning Most of the text data are cleaned by following steps:
Remove punctuationsTokenization - Converting a sentence into list of wordsRemove stopwordsLammetization/stemming - Tranforming any form of a word to its root wordWe combine above two steps into one function named Data_Cleaning( )
3. Factor creating: use VADER to create our news sentiment factor
VADER stands for
Valence Aware Dictionary for sEntiment Reasoning, is a less resource-consuming sentiment analysis model that uses a set of rules to specify a mathematical model without explicitly coding it. VADER’s resource-efficient approach helps us to decode and quantify the emotions contained in streaming media such as text, audio or video.VADER is sensitive to both Polarity (whether the sentiment is positive or negative) and Intensity (how positive or negative is sentiment) of emotions. It incorporates this by providing a Valence Score to the word into consideration.
Valence Score: A score assigned to the word under consideration by means of observation and experiences rather than pure logic.According to (Gilbert and Hutto 2014) [8], the Valence score is measured on a scale from -4 to +4, where -4 stands for the most ‘Negative’ sentiment and +4 for the most ‘Positive’ sentiment. Intuitively one can guess that midpoint 0 represents ‘Neutral’ Sentiment, and this is how it is defined actually too.
Compound VADER Scores: The compound score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. As explained in the paper [8], researchers used below normalization. $$x=\frac{x}{\sqrt{x^2+\alpha}}$$ where $x = $ sum of valence scores of constituent words, and $\alpha =$ Normalization constant (default value is 15).
In NLTK Vader, the sentiment is calculated by using the SentimentIntensityAnalyzer function. Here we compute the Mean Compound VADER SCORES and write the code into a function named Compute_MeanCompoundVADER()
1. Count of news vs Dates——‘Title’
We first create a barplot of the number of news published by date as follows:
As we see, the y-axis is the number of news published for every day while the x-axis if every day's date, which is quite hard to see each day clearly.
So, let's change the x-axis to only display serval month and arbitrarily choose one week to see it clearly.
As we see, the number of news pubulished by date is mainly centered on weekdays, which is coincide with the trading time on weekdays.
In fact, below is the count number of news pubulished for every weekdays between 2019-11-25 and 2020-11-25.
2. Word Cloud——'Cleaned_Text'
From this Word Cloud, we can straightforwardly see what the main words are of our news contents.
3. For sentiment factor
2020 stock market crash:From wikipediaThe 2020 stock market crash, also referred to as the Coronavirus Crash, was a major and sudden global stock market crash that began on 20 February 2020 and ended on 7 April.
We need to calculate the daily growth rate to represent 'Covid19' factor and set a time period when there are valid values of growth rate.
LineChart
Line chart of daily growth rate can help us figure out the change of growth rate explicitly.
In this part, we reformat the data and merge them together for further processing. The processes are as following:
Factor data is as following:
1.Modelling
In this part, we use the factor data and stock daily return data to do regression. We decide to do regression for each security once, thus, we will then get 70 regression results.
The general regression model is:
$R_i-R_f = constant + β_1*(R_m-R_f) + β_2*HML + β_3*Liquidity + β_4*COVID + β_5*Sentiment$
Besides, because is only available after Jan. 24th, 2020, we set a dummy variable '$D$' for COVID factor. Thus, our regression model will be:
$R_i-R_f = constant + β_1*(R_m-R_f) + β_2*HML + β_3*Liquidity + D*β_4*COVID + β_5*Sentiment$
where $D = 0$ before COVID outbreak, and $D = 1$ after COVID outbreak.
2.Analysis
After doing regression, we plot a brief view for p-value of the 70 linear regression models. In the heat map, we can see that the factor COVID growth rate and sentiment does not work well, the factor liquidity also works not so well. However, the market factor works quite well.
Then we do F-test for the regressions to see whether the whole regression is significant.And found that among 70 regressions, all of the regression is significant with a confidence level of 99%, which indicated that although some of the factor do not performs so well, out model works well.
Besides, we plot the adjusted R square of the 70 regressions to see whether our variable explain the dependent comprehensively. As is shown in the following chart, most of the adjusted R square of the regressions are over 0.5, and large part of them lies above 0.6. Thus, our regression model explains all the changes well.
In the following part, we do validation of the result of all our models. However, in order to make it neat, we would mostly use stock 'AAPL' as an example.
1.Check Linearility
Firstly, we check linearility. As is shown in the chart below, we see that most of the dots $(actual, predicted)$ lies near the line $y=x$. Thus, our model satisfies linearility. We also check linearility of other regressions, they all meets linearility.
2.Check Multicollinearity
For multicolinearility, we also take AAPL as an exmaple. We firstly use a heat map to vasualize the correlations between factors. When $VIF$ of a factor is larger than ten, there is potential multicollinearity, and when $VIF$ of a factor is larger than 100, there is certain $VIF$ of a factor is larger than. Use AAPL, we found that the $VIF$ of five factors are all less than two, they are:
Thus, there is no multicollinearity. We also check the $VIF$ value of other regressions and found they are all in the same situation as AAPL. As a result, our models satisfy assumption of multicollinearity.
3.Check Autocorrelation
In this part, we use box plot to show results of Durbin-Watson test of all the regressions. In the box plot, we see that most above 75% of the $DW$ of regression models lies above 1.7 in the vicinity of 2.0 and others lies between 1.45 and 1.7. Thus, we consider our models to satisfied assumption of no autocorrelation.
4.Check Homoscedasticity
To checkhomoscedasticity, we plot the residual of a regression to see whether it has relative stable variance. We also take AAPL as an example here and check the rest regressions. We can see that in the plot,residuals mostly lies in the vicinity of the green line except several outliers, which mean they have relative constant variance. Thus, homoscedasticity is satisfied. Such plot charts of other regressions are also found to have a similar mode.
5.Check Normality
We use Anderson-Darliing test to check the Normality of the residuals in this part. Use AAPL, we found that the rough distribution is skewed and kurtosis of the distribution is different from normal distribution. For other stocks, normality is also not satisfied. The result means there might be further adjust needed in out model.
To make a brief recap, our regression model works well from a general perspective. However, some of the factors not works so well. We believe that factors like COVID-19 factor and sentiment factor has some relations with the stock return, but maybe not in a linear relation. Thus, hopefully we can do further works on relating factors like COVID-19 with stock return in the future.
For Market Risk factor, the beta of an investment security (i.e. a stock) is a measurement of its volatility of returns relative to the entire market. P- value shows that all the beta for market risk factor is significant, meaning that market risk factor perfectly fits in the model. It is not surprising as the market risk factor is verfied by both CAPM and Fama-French model.
For Liquidity factor, about half of regression P-value is not significant and some of them are more than 0.9, meaning that for some stocks, 'Liquidity' can not explain the daily return of stock very well. However, for another half stocks, the performance of 'Liquidity' factor is accpetable. P-value for them are close to 0 (white blocks showing in the heat map above). Performance of this factor may depend on the type of companies as the performance of p-value is half-half.
For HML factor, we can find that at most times, the HML factor which represents the growth premium of the whole market is significant, showing that it can affect the daily return of each stock in the market. Furthermore, the number of positive coefficients is slightly more than the negative ones, which is 38 versus 32, indicating that the growth premium of the company can explain its excess return while the outbreak of covid-19 widened the gap between growth companies and value companies.
For News sentiment factor, most of the p-values of the regression results are not significant. There may be many reasons, for example, one reason is that the stock market of US is mostly constitue with institutional investors, they are more professional and less infected by the related news, the other possible reason is that the news we collect has some bias, say, the mean compounded vader scores are mostly positive and even close to 1, which shows some positive kind of bias of the news to the whole market.
For Covid19 factor, most of the p-values of the regression results are also not significant, which may result from the same reason as News sentiment factor. Also, in this regression model, we use COVID19 increasing rate as a factor to do regression, which might cause a paradox: for example, if the cases of COVID increases 60 thousand every day, then inverstors might think passive about stock market with the deterioration of the pandemic, however, the increase rate will decrease due to the increase of the denominator of the increase rate. Thus, a better way to use the COVID data might be needed.
[1] Fama, E. F., & French, K. R. (1993). Common risk factors in the returns on stocks and bonds. Journal of Financial Economics, 33(1), 3–56.
[2] Danyliv, O., Bland, B., & Nicholass, D. (2014b). A Practical Approach to Liquidity Calculation. The Journal of Trading, 9(3), 57–65.
[3] Gopalan, R., Kadan, O., & Pevzner, M. (2009). Managerial Decisions, Asset Liquidity, and Stock Liquidity. SSRN Electronic Journal, 5–6. https://doi.org/10.2139/ssrn.1342706
[4] Understanding High Minus Low (HML). (n.d.). Investopedia. Retrieved December 12, 2020, from https://www.investopedia.com/terms/h/high_minus_low.asp
[5] Nasekin, S., & Chen, C. Y.-H. (2020). Deep learning-based cryptocurrency sentiment construction. Digital Finance, 2(1–2), 39–67. https://doi.org/10.1007/s42521-020-00018-y
[6] Ashraf, B. N. (2020). Stock markets’ reaction to COVID-19: Cases or fatalities? Research in International Business and Finance, 54, 101249. https://doi.org/10.1016/j.ribaf.2020.101249
[7] The widening gap between growth and value companies. (2020, August 7). FTAdviser. https://www.ftadviser.com/investments/2020/08/07/the-widening-gap-between-growth-and-value-companies/?page=1
[8] Gilbert, C. H. E., and Erric Hutto. (2014). “Vader: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text.” In Eighth International Conference on Weblogs and Social Media (ICWSM-14). Available at (20/04/16) http://comp.Social.Gatech.edu/papers/icwsm14. Vader. Hutto. Pdf, 81:82.